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Abstract — We  describe  Information  Forests,  an  approach  to 
classification  that  generalizes  Random  Forests  by  replacing  the 
splitting  criterion  of  non-leaf  nodes  from  a  discriminative  one  - 
based  on  the  entropy  of  the  label  distribution  -  to  a  generative 
one  -  based  on  maximizing  the  information  divergence  between 
the  class-conditional  distributions  in  the  resulting  partitions.  The 
basic  idea  consists  of  deferring  classification  until  a  measure 
of  “classification  confidence”  is  sufficiently  high,  and  instead 
breaking  down  the  data  so  as  to  maximize  this  measure.  In 
an  alternative  interpretation,  Information  Forests  attempt  to 
partition  the  data  into  subsets  that  are  “as  informative  as 
possible”  for  the  purpose  of  the  task,  which  is  to  classify 
the  data.  Classification  confidence,  or  informative  content  of 
the  subsets,  is  quantified  by  the  Information  Divergence.  Our 
approach  relates  to  active  learning,  semi-supervised  learning, 
mixed  generative/discriminative  learning. 

I.  Introduction 

We  introduce  Information  Forests  (IFs),  a  family  of  part- 
based  classifiers  designed  for  problems  that  are  not  easily 
solvable  as  a  whole.  In  IFs  there  is  a  hidden  location  or 
selection  variable  that  is  key  to  performing  classification: 
While  there  may  be  no  distinguishing  characteristic  between 
the  positive  and  negative  samples  considered  as  a  whole, 
one  can  find  “informative  subsets”  (regions,  parts,  or  groups) 
where  classification  is  simple  to  carry  out.  However,  IFs  are 
not  restricted  to  these  problems,  and  can  be  interpreted  as 
a  generic  family  of  classifiers  that  includes  Random  Forests 
(RFs)  as  a  special  case. 

The  motivation  comes  from  problems  such  as  detection  of 
people  in  images,  where  the  distribution  of  intensity  or  color 
values  in  the  region  occupied  by  a  person  is  not  discriminative, 
and  could  be  identical  to  the  distribution  of  intensity  or  color 
values  outside  the  same  region.  However,  when  the  problem 
is  restricted  to  smaller  regions,  or  “parts,”  the  problem  may 
be  more  easily  solved. 

A.  Intuition 

The  key  idea  of  Information  Forests  is  to  defer  attempts 
to  classify  data  points,  and  focus  first  on  grouping  them  in 
a  way  that  makes  classification  as  simple  as  possible.  In 
other  words,  the  goal  at  the  outset  is  not  to  partition  the 
data  into  clusters  that  are  as  “pure”  as  possible  (belonging 
to  the  same  class).  Instead,  the  goal  is  to  partition  the  data 
into  clusters  that  are  as  simple  as  possible  to  classify  down 
the  line,  and  only  perform  the  classification  when  it  becomes 
sufficiently  simple.  In  other  words  yet,  the  focus  is  to  break 
down  the  original  classification  problem  (for  the  entire  dataset) 
into  smaller  subsets  that  are  as  simple  as  possible  to  classify. 


Only  when  the  classification  problem  is  “simple  enough”  it  is 
actually  carried  out.  Otherwise,  the  grouping  process  proceeds 
in  a  recursive,  hierarchical  fashion.  In  this  divide-et-impera 
scheme,  the  goal  is  to  determine  groups  of  data  that  are  as 
informative  as  possible  for  the  purpose  of  the  task,  which  is 
the  determination  of  the  class  label  A.  Such  groups  can  be 
considered  “regions”  or  “parts”  or  “subsets”  depending  on  the 
application.  This  is  illustrated  in  Fig.  1 


Fig.  1.  Random  Forest  vs.  Information  Forest.  A  sequence  of  n  groups 
alternating  positive/negative/positive/negative  etc.  partitioned  using  a  Random 
Forests  with  linear  stumps  requires  a  number  of  levels  that  grows  linearly  with 
n  (left).  An  Information  Forest  using  the  same  stumps  (right)  does  not  try  to 
classify  samples  immediately,  but  instead  tries  to  partition  them  into  groups  that 
are  simple  to  classify,  and  defers  the  decision  until  confidence  r  is  sufficiently 
high  and  information  gain  S  sufficiently  small. 


B.  Formalization 

Let  A  £  {0, 1}  be  a  binary  class  label,  x  £  D  C  Rfe, 
with  k  =  2, 3  a  location  variable,  and  y  :  D  -A  Y,  x  K > 
y(x)  a  measurement  (or  “feature”)  associate  to  location  x, 
that  takes  values  in  some  vector  space  Y.  When  the  domain 
D  is  discretized  (e.g.,  the  planar  lattice),  x  can  be  identified 
with  an  index  i  £  A  |  Xi  £  D.  In  that  case,  we  indicate  y(x) 
simply  by  yt. 

A  (binary1)  segmentation  problem  consists  of  partitioning 
the  spatial  domain  D  into  two  regions,  O  and  D\Q,  accord¬ 
ing  to  the  value  of  the  feature  y(x).  This  can  be  done  by 
considering  the  posterior  probability 

P(My)  oc  p(y\X)P(X),  (1) 

where  the  first  term  on  the  right  hand  side  indicates  the 
likelihood,  and  the  second  term  the  location  prior.  It  should 
be  clear  that  meaningfully  solving  this  problem  hinges  on  the 
two  likelihoods,  p(y |A  =  1)  and  p(y |A  =  0)  being  different: 

p(y\X=l)  ^p(y\X  =  0).  (2) 

1  Extension  to  multi-class  segmentation,  where  A  E  {1,2, . . . ,  M}  is 
straightforward  and  will  therefore  not  be  considered  here. 
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If  this  is  the  case,  we  can  infer  A  and,  from  it,  fl  =  {x  |  A(:r)  = 
1}.  However,  there  are  plenty  of  examples  where  where  (2) 
is  violated.  We  refer  to  problems  where  the  condition  (2) 
is  violated  as  problems  that  “are  not  solvable  as  whole”,  in 
the  sense  that  we  cannot  segment  the  spatial  domain  simply  by 
comparing  statistics  inside  fl  to  statistics  outside.  Nevertheless, 
it  may  be  possible  to  determine  parts,  or  local  regions  5)  C  D, 
within  which  the  likelihoods  are  different: 


3  |  p(y \x  G  Sj,  A  =  1)  ^p(y\x  G  Sj,  A  =  0), 

SjCD,  j  =  l,...,N.  (3) 

Note  that  the  collection  { Sj }  is  not  unique,  does  not  need  to 
form  a  partition  of  D,  as  there  is  no  requirement  that  S',;  fl  Sj  f 
0  for  i  ^  j,  so  long  as  the  union  of  these  regions  cover2  D. 
The  regions  Sj  do  not  even  need  to  be  simply  connected. 
In  some  applications,  one  may  want  to  impose  these  further 
conditions. 

In  the  discrete -domain  case,  we  identify  the  index  i  with  the 
location  xi:  so  the  regions  become  subsets  of  the  data.  With 
an  abuse  of  notation,  we  write 


Sj  =  {h,  i2»  (4) 

Therefore,  we  write  the  two  conditions  (2)-(3)  as 


p{yi\K  =  i)  =p{y%  |A»  =  0), 

p(Vi\i  G  Sj,Xi  =  1)  ^p(yi\i  G  Sj,Xi  =  0). 


(5) 


Assuming  these  conditions  are  satisfied,  we  can  write  the 
posteriors  by  marginalizing  over  the  sets  Sj, 

p(X\yi)  oc  Y^PiVi  I  i  G  Sj,X)P(i  G  Sj\X)P(X)  (6) 
j 


or  by  maximizing  over  all  possible  collections  of  sets  {Sj}. 
In  either  case,  the  sets  Sj  are  not  known,  so  the  segmentation 
problem  is  naturally  broken  down  into  two  components:  One 
is  to  determine  the  sets  Sj,  the  other  is  to  determine  the  class 
labels  within  each  of  them: 

Given  a  training  set  of  labeled  samples  {y,.  Xi}fL1, 

Find  a  collection  of  sets  {Sj}*L  1  such  that  Sj  C  D  and 
D  C  U jSj,  that  are  “as  informative  as  possible”  for 
the  purpose  of  determining  the  class  label  A. 

If  the  sets  are  “sufficiently  informative”  of  fl,  perform 
the  classification;  that  is,  determine  the  label  A  within 
these  sets. 

The  key  condition  translates  to  the  restricted  likelihoods 
p(y%\i  G  Sj,  X  =  1)  andp(yi\i  G  Sj,  X  =  1)  being  “as  different 
as  possible”  in  the  sense  of  relative  entropy  (information 
divergence,  of  Kullback-Liebler  divergence).  When  they  are 
sufficiently  different,  the  set  is  sufficiently  informative  of 
fi,  and  classification  can  be  easily  performed  by  comparing 
likelihood  or  posterior  ratios. 


-Indeed,  even  this  condition  can  be  relaxed  to  assuming  that  these  regions 
cover  the  boundary  of  Q,  U jSj  Z>  dft.  by  making  suitable  assumptions  on 
the  prior  p(A|a:). 


This  problem  relates  to  active  learning,  in  the  sense  that  the 
classifier  has  to  select,  among  all  possible  subsets,  the  ones 
that  are  informative  in  the  sense  of  enabling  the  classification 
A.  A  possible  approach  would  be  to  select  5)  at  random. 
However,  an  active  learner  would  want  to  choose,  among  all 
possible  Si,  the  ones  that  are  most  informative  towards  solving 
the  original  classification  problem,  that  is  to  determine  A.  It 
also  relates  to  semi-supervised  learning  with  model  selection, 
since  -  in  addition  to  determining  the  discrete  variable  A  for 
which  supervision  is  provided  via  the  training  set  -  one  has 
to  determine  the  sets  Sj,  that  can  be  interpreted  as  groupings, 
or  collections,  or  subsets  of  the  training  data.  However,  no 
supervision  is  given  as  to  which  point  x  G  D  belongs  to  which 
group  Si.  In  addition,  the  number  of  such  regions  N  is  not 
known  and  has  to  be  inferred  (model  selection).  This  problem 
also  touches  on  the  issue  of  generative/discriminative  models, 
since  the  groups  Sj  can  be  interpreted  as  generative  (latent 
mixture  model),  while  the  ultimate  goal  is  classification. 

Information  Forests  implement  the  program  above  using  the 
machinery  of  boosting  and  decision  trees,  as  we  describe  next. 

II.  Derivation  of  Information  Forests 

Information  Forests  are  a  family  of  classifiers  that  accom¬ 
plish  the  goals  described  in  the  previous  sections  using  the 
tools  of  randomized  trees. 

The  groups  (“clusters”,  or  “regions”)  Sj  C  D  are  chosen 
within  a  class  S  defined  by  a  family  of  simple  classifiers 
(decision  stumps).  For  convenience,  we  expand  the  index  j 
into  two  indices,  one  relating  to  the  “features”  fj  and  one 
relating  to  a  threshold  9k ■  We  then  define,  for  a  continuous 
location  parameter  x 

Sjk  =  {x  G  D  |  fj(x,y)  >  9k}  (7) 

where  the  feature  /  :  D  x  Y  — >  R;  (x,  y)  H »•  f(x,  y)  is 

any  scalar-valued  statistic  and  the  threshold  9  G  K.  is  chosen 
within  a  finite  set.  We  call  the  set  of  features  T  =  {fj}  and 
the  set  of  thresholds  0  =  {9k}-  The  complement  of  Sjk  in  D 
is  indicated  with  Sjk  =  {x  G  D  |  fj(x,y)  <  9k}  =  D\Sjk. 
In  the  simplest  case,  for  a  grayscale  image,  we  could  have 
f(x,  y)  =  y(x)  where  y(x)  is  the  intensity  value  at  pixel  x. 
More  in  general,  /  can  be  any  (scalar)  function  of  y  in  a 
neighborhood  of  x.  For  the  discrete  case,  where  i  is  identified 
with  the  location  Xi,  with  an  abuse  of  notation  we  write 

Sjk  =  {i  G  A  |  fjfy-i)  >  9k}  (8) 

and  again  Sfk  =  {i  G  A  |  fj{Vi)  <  $fc}-  Here  the  features  / 

are  /  :  A  x  Y  — >  R;  (i,y)  *- >  f(yi).  Specifying  the  feature 
and  threshold  ( fj,9k )  is  equivalent  to  specifying  the  set  Sjk 
and  its  complement  Sjk. 

We  are  interested  in  building  informative  sets  using  recur¬ 
sive  binary  partitions,  so  at  each  stage  we  only  select  one 
pair  {Sjk,Sjk}.  Among  all  features  in  T  and  thresholds  in 
0,  Information  Forests  choose  the  one  that  makes  the  set  Sjk 
“as  informative  as  possible”  for  the  purpose  of  classification. 
From  (5)  it  can  be  seen  that  the  quantity  that  measures  the 
“information  content”  of  a  set  Sjk  (or  a  feature  fj,9k )  for  the 
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purpose  of  classification  is  the  Information  Divergence  (Rel¬ 
ative  Entropy,  or  Kullback-Liebler  Divergence)  between  the 
distributions  p{yi\i  G  Sjk,Xi  =  1)  and  p(yi\i  G  S:jk,  A,  =  0). 
In  short-hand,  we  write  p(jji\  ■  ■  ■  ,Xi  =  1)  as  pi(yi\  ■  ■  ■ )  and 
p(yi I  •  •  •  ,  A,  =  0)  as  po(yi\ ■■■)  and 

KL{fj,ek)  =  g  5)  II  p0(yi\i  G  S))+ 

+  G  Sc)  ||  p0(yi\i  G  Sc)).  (9) 

From  the  characterization  of  the  sets  Sjk,  i  G  Sjk  is  equivalent 
to  fj(yi)  >  9k,  so  we  write  Sjk  =  S(fj,9k).  Therefore, 
a  decision  stump  (“KL-node”)  chooses  among  features  and 
thresholds  one  (of  the  possibly  many)  that 


f  9i  —  are  max  lYXliliMl 
fv9k  argmax  |£)| 

Kh  (p1(yi\fj  >  9k)\\p0{yi\fj  >  9k)) 

+  |gC(^gfc)lKL (pifol/#  <  9k)MVi\fj  <  Ok))  ■  (1 


Here  KL(p||g)  =  Ep  In  |  =  f  In  ^dP  denotes  the 

Kullback-Liebler  divergence.3  The  normalization  factors 
|5|/|D|  and  |S'C|/|D|  count  the  cardinality  of  the  set  S  and 
its  complement  relative  to  the  size  of  the  domain  D. 

If  the  divergence  value  is  sufficiently  large,  K  L(  f  r  9k  )  >  r, 
the  positive  and  negative  distributions  are  sufficiently  different, 
and  therefore  the  classification  problem  is  easily  solvable.  To 
actually  solve  it,  one  could  use  the  same  decision  stumps 
(features)  P ,  but  now  chosen  to  minimize  the  entropy  of  the 
distribution  of  class  labels,  p{Xj\i  G  Sjk)  =  p(A,|/j  >  9k), 
and  its  complement: 


=  lS{f^k)lM(X i\fj  >  9k)+ 

+  |gC(|/^t)lB(A,|/j<0t)  (11) 

where  H(p)  =  Ep[lnp]  =  J  In  pdP  is  the  entropy  of 
the  distribution  p.  If  the  quantity  (10)  is  sufficiently  large, 
KL (fj,9k)  >  r,  (11)  can  be  solved.  If  not,  the  process 
can  be  iterated,  and  the  data  further  split  according  to  the 
same  criterion,  the  maximization  of  I\L(fj,9k).  The  value  r 
can  therefore  be  interpreted  as  measuring  the  least  tolerable 
confidence  in  the  classification. 


A.  Implementation 

Information  Forests  perform  hierarchical  grouping  (mixture 
modeling)  and  classification  by  recursive  binary  partitioning. 
During  training,  starting  from  a  the  entire  dataset  {1, . . . ,  N}, 
each  node  S  is  passed  through  a  Divergence  Test: 

IKL(pi(t/j|f  G  S)  ||  po(yi\i  G  S))  >  t.  (12) 


3  Several  alternate  divergence  measures  can  be  employed  instead  of 
Kullback-Leibler’s,  for  instance  symmetrized  versions  of  it,  or  more  general 
Jeffrey  divergence. 


If  this  condition  is  satisfied,  the  node  is  designated  as  an  H- 
node  that  solves 

03) 

If  the  Information  Gain  is  below  a  minimum  threshold  S  >  0, 

H(A<|  i€S)-H(fj,9k)<5,  (14) 

the  node  is  re-designated  as  a  terminal  node  (“leaf”)  and  the 
classes  are  determined  via 

A  =  arg  max  piXAi  G  S).  (15) 

Ai6{0,l} 

If  the  condition  (12)  is  violated,  the  two  classes  are  difficult 
to  separate,  so  we  look  to  partition  the  data  into  new  clusters 
via  a  KL-node  that  solves 

(16) 

In  either  case,  so  long  as  the  node  is  not  a  leaf,  the  selected 
fj.  9k  generates  two  sets,  S(fj,9k)  and  its  complement,  where 

S(fj,9k)  =  {i£  S  | fj{yi)  >  9k}.  (17) 

The  two  sets  S  =  S(fj,9k)  and  S  =  Sc(fj,  9k)  are  fed  each  to 
one  of  the  two  children  of  the  current  node  as  the  tree  grows. 
Like  in  a  Random  Forest,  the  process  is  repeated  multiple 
times,  for  random  subsets  of  the  data  points.  During  testing, 
each  datum  y,:  is  run  through  the  cascade  of  tests  fjiyi)  >  9k, 
on  multiple  trees,  and  then  voting  is  performed. 

B.  Approximation  and  lower  bound 

While  testing  consists  of  repeated  scalar  tests  that  have 
trivial  computational  complexity,  training  requires  multiple 
iterations  of  exhaustive  optimization  at  each  node,  where  each 
step  entails  computing  KL(f ,  9),  that  is  a  relative  entropy 
between  distributions  in  high-dimensional  space  (the  feature 
space  Y).  Therefore,  efficient  approximations  are  needed. 

One  could  employ  several  proxies  of  relative  entropy,  in¬ 
cluding  Fisher  scores.  Or,  one  could  compute  relative  entropy 
between  scalar  components  (projections)  of  feature  space.  We 
approximate  the  Information  Divergence  with  a  lower  bound 

KL(Pl(yi\fj  >  9j)  ||  p0{Vi\fj  >  9j))  > 

>  KL(pi(n(t/;)|/j  >  6j)  ||  p0{U(yi)\fj  >  0j ))  (18) 

where  I  \(y1)  is  any  1-D  projection  of  ;</, .  For  ease  of  compu¬ 
tation,  we  choose  n (j/j)  =  f(yi)  from  our  feature  pool.  Since 
the  previous  inequality  holds  for  any  n,  we  have 

^L(Pi(Vi\fj  >  9j)  ||  Po(Vi\fj  >  9j))  > 

>  max  KL(pi(f(yi)\fj  >  9j)  ||  Po(f(yi)\fj  >  9j)).  (19) 

JEJ- 

This  process  is  repeated  according  to  the  same  schedule  of 
conventional  Random  Forests. 
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C.  Analysis 

Information  Forests  are  a  superset  of  Random  Forest,  as 
the  former  reduces  to  the  latter  when  r  =  0  is  chosen.  While 
it  has  been  argued  [1]  that  RF  produce  balanced  trees,  this  is 
true  only  when  the  class  T  is  infinite.  In  practice,  T  is  always 
finite,  and  typically  RFs  produce  heavily  unbalanced  trees,  as 
the  example  in  Fig.  1  illustrates.  That  example  also  shows 
that,  when  the  dataset  is  not  separable  by  the  class  of  decision 
stumps,  IFs  produce  more  balanced  and  shallower  trees  when 
the  set  of  classifiers  is  restricted. 

More  thorough  analysis  of  the  properties  of  IFs  and  the  class 
of  problems  they  are  well  matched  to  solve  is  forthcoming. 

III.  Discussion 

Random  Forests  as  a  boosting  variety  of  randomized  de¬ 
cision  trees,  have  been  employed  with  a  variety  of  splitting 
criteria,  mostly  related  to  entropy  of  the  label  distributions  or 
mutual  information  between  the  features  and  the  labels  [5], 
[6],  [2].  Breiman  analyzes  some  of  the  properties  of  entropy 
and  compares  it  with  the  Gini  index  in  [1],  However,  to 
the  best  of  our  knowledge,  all  of  these  approaches  choose 
discriminative  splitting  criteria,  where  the  goal  is  to  produce 
partitions  that  are  as  pure  as  possible  at  each  node,  and  there 
is  no  differentiation  between  leaf  nodes  and  non-leaf  nodes. 

Several  choices  of  decision  stumps  have  also  been  applied, 
mostly  depending  on  the  application,  with  the  simplest  choices 
consisting  of  linear  classifiers  [3],  We  have  used  simple 
linear  scalar  stumps  for  simplicity,  but  there  is  nothing  in 
the  derivation  of  IFs  that  precludes  the  use  of  more  complex 
classifiers  (other  than  computational  considerations). 

Since  our  approach  mixes  divergence  measures  and  classi¬ 
fication  measures,  the  analysis  of  Nguyen  et  al.  [4]  could  shed 
some  light  on  the  properties  of  the  scheme  proposed. 

In  forthcoming  work,  we  intend  to  characterize  the  perfor¬ 
mance  of  IFs  both  empirically,  as  well  as  analytically. 
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